counterfactual vision-and-language navigation
Counterfactual Vision-and-Language Navigation: Unravelling the Unseen
The task of vision-and-language navigation (VLN) requires an agent to follow text instructions to find its way through simulated household environments. A prominent challenge is to train an agent capable of generalising to new environments at test time, rather than one that simply memorises trajectories and visual details observed during training. We propose a new learning strategy that learns both from observations and generated counterfactual environments. We describe an effective algorithm to generate counterfactual observations on the fly for VLN, as linear combinations of existing environments. Simultaneously, we encourage the agent's actions to remain stable between original and counterfactual environments through our novel training objective-effectively removing the spurious features that otherwise bias the agent. Our experiments show that this technique provides significant improvements in generalisation on benchmarks for Room-to-Room navigation and Embodied Question Answering.
Review for NeurIPS paper: Counterfactual Vision-and-Language Navigation: Unravelling the Unseen
Summary and Contributions: This paper introduces a method for generating *counterfactual* visual features for augmenting the training of vision-and-language navigation (VLN) models (which predict a sequence of actions to carry out a natural language instruction, conditioning on a sequence of visual inputs). Counterfactual training examples are produced by perturbing the visual features in an original training example with a linear combination of visual features from a similar training example. Weights (exogenous variables) in the linear combination are optimized to jointly minimize the edit to the original features and maximize the probability that a separate speaker (instruction generation) model assigns to the true instruction conditioned on the resulting counterfactual features, subject to the constraint that the counterfactual features change the interpretation model's predicted timestep at every action. Once these counterfactual features are produced, the model is trained to encourage it to assign equal probability to actions in the original example when conditioning on the original and the counterfactual features (in imitation learning), or to obtain equal reward (in reinforcement learning). The method improves performance on unseen environments for the R2R benchmark for VLN, and also shows improvements on embodied question answering.